perceptual data
Sora and V-JEPA Have Not Learned The Complete Real World Model -- A Philosophical Analysis of Video AIs Through the Theory of Productive Imagination
Sora from Open AI has shown exceptional performance, yet it faces scrutiny over whether its technological prowess equates to an authentic comprehension of reality. Critics contend that it lacks a foundational grasp of the world, a deficiency V-JEPA from Meta aims to amend with its joint embedding approach. This debate is vital for steering the future direction of Artificial General Intelligence(AGI). We enrich this debate by developing a theory of productive imagination that generates a coherent world model based on Kantian philosophy. We identify three indispensable components of the coherent world model capable of genuine world understanding: representations of isolated objects, an a priori law of change across space and time, and Kantian categories. Our analysis reveals that Sora is limited because of its oversight of the a priori law of change and Kantian categories, flaws that are not rectifiable through scaling up the training. V-JEPA learns the context-dependent aspect of the a priori law of change. Yet it fails to fully comprehend Kantian categories and incorporate experience, leading us to conclude that neither system currently achieves a comprehensive world understanding. Nevertheless, each system has developed components essential to advancing an integrated AI productive imagination-understanding engine. Finally, we propose an innovative training framework for an AI productive imagination-understanding engine, centered around a joint embedding system designed to transform disordered perceptual input into a structured, coherent world model. Our philosophical analysis pinpoints critical challenges within contemporary video AI technologies and a pathway toward achieving an AI system capable of genuine world understanding, such that it can be applied for reasoning and planning in the future.
Stober
Robots with many sensors are capable of generating volumes of high-dimensional perceptual data. Making sense of this data and extracting useful knowledge from it is a difficult problem. For robots lacking proper models, trying to understand a stream of uninterpreted data is an especially acute problem. One critical step in linking raw uninterpreted perceptual data to cognition is dimensionality reduction. Current methods for reducing the dimension of data do not meet the demands of a robot situated in the world, and methods that use only perceptual data do not take full advantage of the interactive experience of an embodied robot agent. This work proposes a new scalable, incremental and active approach to dimensionality reduction suitable for extracting geometric knowledge from uninterpreted sensors and effectors. The proposed method uses distinctive state abstractions to organize early sensorimotor experience and sensorimotor embedding to incrementally learn accurate geometric representations based on experience. This approach is applied to the problem of learning the geometry of sensors, space, and objects. The result is evaluated using techniques from statistical shape analysis.
Brain-inspired automated visual object discovery and detection
Chen, Lichao, Singh, Sudhir, Kailath, Thomas, Roychowdhury, Vwani
Despite significant recent progress, machine vision systems lag considerably behind their biological counterparts in performance, scalability, and robustness. A distinctive hallmark of the brain is its ability to automatically discover and model objects, at multiscale resolutions, from repeated exposures to unlabeled contextual data and then to be able to robustly detect the learned objects under various nonideal circumstances, such as partial occlusion and different view angles. Replication of such capabilities in a machine would require three key ingredients: (i) access to large-scale perceptual data of the kind that humans experience, (ii) flexible representations of objects, and (iii) an efficient unsupervised learning algorithm. The Internet fortunately provides unprecedented access to vast amounts of visual data. This paper leverages the availability of such data to develop a scalable framework for unsupervised learning of object prototypes--brain-inspired flexible, scale, and shift invariant representations of deformable objects (e.g., humans, motorcycles, cars, airplanes) comprised of parts, their different configurations and views, and their spatial relationships. Computationally, the object prototypes are represented as geometric associative networks using probabilistic constructs such as Markov random fields. We apply our framework to various datasets and show that our approach is computationally scalable and can construct accurate and operational part-aware object models much more efficiently than in much of the recent computer vision literature. We also present efficient algorithms for detection and localization in new scenes of objects and their partial views.
Learning Sensor, Space and Object Geometry
Stober, Jeremy (The University of Texas at Austin)
Robots with many sensors are capable of generating volumes of high-dimensional perceptual data. Making sense of this data and extracting useful knowledge from it is a difficult problem. For robots lacking proper models, trying to understand a stream of uninterpreted data is an especially acute problem. One critical step in linking raw uninterpreted perceptual data to cognition is dimensionality reduction. Current methods for reducing the dimension of data do not meet the demands of a robot situated in the world, and methods that use only perceptual data do not take full advantage of the interactive experience of an embodied robot agent. This work proposes a new scalable, incremental and active approach to dimensionality reduction suitable for extracting geometric knowledge from uninterpreted sensors and effectors. The proposed method uses distinctive state abstractions to organize early sensorimotor experience and sensorimotor embedding to incrementally learn accurate geometric representations based on experience. This approach is applied to the problem of learning the geometry of sensors, space, and objects. The result is evaluated using techniques from statistical shape analysis.